Get the data:
from sklearn import datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import plotly
plotly.offline.init_notebook_mode()
iris = datasets.load_iris(as_frame=True)
y = iris.target_names[iris.target] == 'virginica'
target_names = {0: 'non-virginica',
1: 'virginica'}
target = np.where(y, 1, 0)
data = iris.data
df_iris = pd.DataFrame(data, columns = iris.feature_names)
df_iris['target'] = target
df_iris['target_names'] = df_iris['target'].map(target_names)
df_iris
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | target_names | |
|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 | non-virginica |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 | non-virginica |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 | non-virginica |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 | non-virginica |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 | non-virginica |
| ... | ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | 1 | virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | 1 | virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | 1 | virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | 1 | virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | 1 | virginica |
150 rows × 6 columns
Discriptive Statistics for Non-Virginica:
desc_non_virginica = df_iris[df_iris['target_names'] == 'non-virginica'].describe()
desc_non_virginica
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
|---|---|---|---|---|---|
| count | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.0 |
| mean | 5.471000 | 3.099000 | 2.861000 | 0.786000 | 0.0 |
| std | 0.641698 | 0.478739 | 1.449549 | 0.565153 | 0.0 |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 | 0.0 |
| 25% | 5.000000 | 2.800000 | 1.500000 | 0.200000 | 0.0 |
| 50% | 5.400000 | 3.050000 | 2.450000 | 0.800000 | 0.0 |
| 75% | 5.900000 | 3.400000 | 4.325000 | 1.300000 | 0.0 |
| max | 7.000000 | 4.400000 | 5.100000 | 1.800000 | 0.0 |
Discriptive Statistics for Virginica:
desc_virginica = df_iris[df_iris['target_names'] == 'virginica'].describe()
desc_virginica
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
|---|---|---|---|---|---|
| count | 50.00000 | 50.000000 | 50.000000 | 50.00000 | 50.0 |
| mean | 6.58800 | 2.974000 | 5.552000 | 2.02600 | 1.0 |
| std | 0.63588 | 0.322497 | 0.551895 | 0.27465 | 0.0 |
| min | 4.90000 | 2.200000 | 4.500000 | 1.40000 | 1.0 |
| 25% | 6.22500 | 2.800000 | 5.100000 | 1.80000 | 1.0 |
| 50% | 6.50000 | 3.000000 | 5.550000 | 2.00000 | 1.0 |
| 75% | 6.90000 | 3.175000 | 5.875000 | 2.30000 | 1.0 |
| max | 7.90000 | 3.800000 | 6.900000 | 2.50000 | 1.0 |
Histograms for Iris Features:
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
plt.subplot(2, 2, 1)
sns.histplot(data=df_iris, x='sepal length (cm)', hue="target_names", palette="bright", alpha=0.6, hue_order= ['virginica', 'non-virginica'])
plt.title('Sepal Length')
plt.subplot(2, 2, 2)
sns.histplot(data=df_iris, x='sepal width (cm)', hue="target_names", palette="bright", alpha=0.6, hue_order= ['virginica', 'non-virginica'])
plt.title('Sepal Width')
plt.subplot(2, 2, 3)
sns.histplot(data=df_iris, x='petal length (cm)', hue="target_names", palette="bright", alpha=0.6, hue_order= ['virginica', 'non-virginica'])
plt.title('Petal Length')
plt.subplot(2, 2, 4)
sns.histplot(data=df_iris, x='petal width (cm)', hue="target_names", palette="bright", alpha=0.6, hue_order= ['virginica', 'non-virginica'])
plt.title('Petal Width')
plt.tight_layout(rect=[0, 0.05, 1, 0.95])
Correlation matrix between the four features:
df_iris_corr = df_iris.copy()
df_iris_corr = df_iris_corr.drop('target_names', axis = 1)
#we needed to drop the names in order to make a correlation matrix
corr_values = df_iris_corr.corr().round(2)
sns.heatmap(corr_values, cmap="RdBu", annot=True, linewidth=.7)
<Axes: >
The graphs given below are taken from this kaggle notebook:
https://www.kaggle.com/code/harshalgadhe/iris-species-prediction-using-ml-techniques/notebook
def scatterplot(X,Y):
sns.scatterplot(data=df_iris,x=X,y=Y,hue='target_names')
plt.title(X+" vs "+Y)
plt.grid()
plt.show()
plt.figure(figsize=(10,6))
scatterplot('sepal width (cm)','sepal length (cm)')
plt.figure(figsize=(10,6))
scatterplot('petal width (cm)','petal length (cm)')
plt.tight_layout()
<Figure size 640x480 with 0 Axes>
Insight:
From the sepal widhth vs sepal length graph given above, we can see that between the lengths between 5.5 and 6.5, it is hard to distinguish between the two classes. Before, in the correlation matrix, the correlation value for target was greater for petal length and petal width. Therefore, in the second graph we can clear distinguish that petal width greater than 1.5 cm and petal length greater than 5.3 will result in Virginica
sns.pairplot(df_iris_corr,hue='target',palette='deep')
#Target 0-> non-virginica
# 1-> virginica
<seaborn.axisgrid.PairGrid at 0x1f0d724db10>
Insights:
From this graph we can see the plots against all features. In these graphs we can get some useful information. The plot for petal length, we can see that it is hard to classify when petal length is 4.5-4.8. This is because both lines intersect at this point.
Next graph is taken from:
https://www.kaggle.com/code/abdmental01/iris-flower-eda-ml-beginner
import plotly.express as px
custom_colors = ['#1f77b4', 'skyblue', 'cornflowerblue']
# Create a box plot to show the relationship between a numerical variable and a categorical variable
fig_boxplot = px.box(df_iris, x="target_names", y="petal width (cm)", title="Box Plot of Petal Width (cm) by Species",
color='target_names', # Set box color based on Species
color_discrete_sequence=custom_colors # Custom color sequence
)
# Update layout
fig_boxplot.update_layout(xaxis_title="Target", yaxis_title="PetalWidthCm",)
# Show the box plot
fig_boxplot.show()
Insights:
This boxplot shows the petal width graph. If we hover over the plot, it shows the min, max values and median of petal width.
Petal width greater than 1.8 will result in Virginica (100%) and petal width less than 1.4 will result in Non-Virginica (100%)
#Split data in training and temp first
X_train, X_temp, y_train, y_temp = train_test_split(df_iris_corr, target, train_size=120, random_state=42)
# Split the temp set into validation and test sets
X_validate, X_test, y_validate, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
print("Training: ",X_train.shape[0])
print("Validate: ",X_validate.shape[0])
print("Test:", X_test.shape[0])
Training: 120 Validate: 15 Test: 15
from sklearn.linear_model import LogisticRegression
#I have taken this sequence of features based the correlation matrix.
#These features are prioritized based on their correlation with the target
features = ['petal width (cm)', 'petal length (cm)', 'sepal length (cm)','sepal width (cm)']
log_reg_1 = LogisticRegression(random_state=42)
log_reg_1.fit(X_train[features[0]].values.reshape(-1,1), y_train)
log_reg_2 = LogisticRegression(random_state=42)
log_reg_2.fit(X_train[features[:2]], y_train)
log_reg_3 = LogisticRegression(random_state=42)
log_reg_3.fit(X_train[features[:3]], y_train)
log_reg_4 = LogisticRegression(random_state=42)
log_reg_4.fit(X_train[features[:4]], y_train)
#Took help from class notebook
#https://github.com/HamnaAshraf1/CSCN8010/blob/main/class_notebooks/logistic_regression/logistic_regression.ipynb
LogisticRegression(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(random_state=42)
pred1 = log_reg_1.predict(X_validate[features[0]].values.reshape(-1,1))
predict_prob1 = log_reg_1.predict_proba(X_validate[features[0]].values.reshape(-1,1))[:,1]
pred2 = log_reg_2.predict(X_validate[features[:2]])
predict_prob2 = log_reg_2.predict_proba(X_validate[features[:2]])[:,1]
pred3 = log_reg_3.predict(X_validate[features[:3]])
predict_prob3 = log_reg_3.predict_proba(X_validate[features[:3]])[:,1]
pred4 = log_reg_4.predict(X_validate[features[:4]])
predict_prob4 = log_reg_4.predict_proba(X_validate[features[:4]])[:,1]
#Took help from class_notebook to see how we can use predict and predict_proba
#https://github.com/HamnaAshraf1/CSCN8010/blob/main/class_notebooks/logistic_regression/logistic_regression.ipynb
data1 = {
'Instance Number': np.arange(1, len(y_validate) + 1),
'Probability of Predicting Virginica': predict_prob1,
'Model Prediction': pred1,
'Ground Truth': y_validate
}
table1 = pd.DataFrame(data1)
table1
| Instance Number | Probability of Predicting Virginica | Model Prediction | Ground Truth | |
|---|---|---|---|---|
| 0 | 1 | 0.007232 | 0 | 0 |
| 1 | 2 | 0.004901 | 0 | 0 |
| 2 | 3 | 0.925283 | 1 | 1 |
| 3 | 4 | 0.925283 | 1 | 1 |
| 4 | 5 | 0.350797 | 0 | 0 |
| 5 | 6 | 0.636206 | 1 | 1 |
| 6 | 7 | 0.636206 | 1 | 1 |
| 7 | 8 | 0.198051 | 0 | 0 |
| 8 | 9 | 0.003318 | 0 | 0 |
| 9 | 10 | 0.003318 | 0 | 0 |
| 10 | 11 | 0.925283 | 1 | 1 |
| 11 | 12 | 0.792805 | 1 | 1 |
| 12 | 13 | 0.004901 | 0 | 0 |
| 13 | 14 | 0.893300 | 1 | 1 |
| 14 | 15 | 0.198051 | 0 | 0 |
data2 = {
'Instance Number': np.arange(1, len(y_validate) + 1),
'Probability of Predicting Virginica': predict_prob2,
'Model Prediction': pred2,
'Ground Truth': y_validate
}
table2 = pd.DataFrame(data2)
table2
| Instance Number | Probability of Predicting Virginica | Model Prediction | Ground Truth | |
|---|---|---|---|---|
| 0 | 1 | 0.000014 | 0 | 0 |
| 1 | 2 | 0.000015 | 0 | 0 |
| 2 | 3 | 0.998655 | 1 | 1 |
| 3 | 4 | 0.897028 | 1 | 1 |
| 4 | 5 | 0.210752 | 0 | 0 |
| 5 | 6 | 0.585651 | 1 | 1 |
| 6 | 7 | 0.936985 | 1 | 1 |
| 7 | 8 | 0.150068 | 0 | 0 |
| 8 | 9 | 0.000009 | 0 | 0 |
| 9 | 10 | 0.000009 | 0 | 0 |
| 10 | 11 | 0.870248 | 1 | 1 |
| 11 | 12 | 0.782899 | 1 | 1 |
| 12 | 13 | 0.000009 | 0 | 0 |
| 13 | 14 | 0.952741 | 1 | 1 |
| 14 | 15 | 0.016507 | 0 | 0 |
data3 = {
'Instance Number': np.arange(1, len(y_validate) + 1),
'Probability of Predicting Virginica': predict_prob3,
'Model Prediction': pred3,
'Ground Truth': y_validate
}
table3 = pd.DataFrame(data3)
table3
| Instance Number | Probability of Predicting Virginica | Model Prediction | Ground Truth | |
|---|---|---|---|---|
| 0 | 1 | 0.000015 | 0 | 0 |
| 1 | 2 | 0.000011 | 0 | 0 |
| 2 | 3 | 0.997968 | 1 | 1 |
| 3 | 4 | 0.874726 | 1 | 1 |
| 4 | 5 | 0.214948 | 0 | 0 |
| 5 | 6 | 0.593969 | 1 | 1 |
| 6 | 7 | 0.930751 | 1 | 1 |
| 7 | 8 | 0.174016 | 0 | 0 |
| 8 | 9 | 0.000011 | 0 | 0 |
| 9 | 10 | 0.000011 | 0 | 0 |
| 10 | 11 | 0.827581 | 1 | 1 |
| 11 | 12 | 0.759765 | 1 | 1 |
| 12 | 13 | 0.000009 | 0 | 0 |
| 13 | 14 | 0.952516 | 1 | 1 |
| 14 | 15 | 0.017705 | 0 | 0 |
data4 = {
'Instance Number': np.arange(1, len(y_validate) + 1),
'Probability of Predicting Virginica': predict_prob4,
'Model Prediction': pred4,
'Ground Truth': y_validate
}
table4 = pd.DataFrame(data4)
table4
| Instance Number | Probability of Predicting Virginica | Model Prediction | Ground Truth | |
|---|---|---|---|---|
| 0 | 1 | 0.000009 | 0 | 0 |
| 1 | 2 | 0.000006 | 0 | 0 |
| 2 | 3 | 0.998534 | 1 | 1 |
| 3 | 4 | 0.873922 | 1 | 1 |
| 4 | 5 | 0.207005 | 0 | 0 |
| 5 | 6 | 0.572730 | 1 | 1 |
| 6 | 7 | 0.946564 | 1 | 1 |
| 7 | 8 | 0.170670 | 0 | 0 |
| 8 | 9 | 0.000008 | 0 | 0 |
| 9 | 10 | 0.000008 | 0 | 0 |
| 10 | 11 | 0.820222 | 1 | 1 |
| 11 | 12 | 0.728198 | 1 | 1 |
| 12 | 13 | 0.000004 | 0 | 0 |
| 13 | 14 | 0.956114 | 1 | 1 |
| 14 | 15 | 0.016196 | 0 | 0 |
accuracy_model_1 = accuracy_score(table1['Ground Truth'], table1['Model Prediction'])
accuracy_model_2 = accuracy_score(table2['Ground Truth'], table2['Model Prediction'])
accuracy_model_3 = accuracy_score(table3['Ground Truth'], table3['Model Prediction'])
accuracy_model_4 = accuracy_score(table4['Ground Truth'], table4['Model Prediction'])
print(f"Accuracy of Model 1: {accuracy_model_1:.2f}")
print(f"Accuracy of Model 2: {accuracy_model_2:.2f}")
print(f"Accuracy of Model 3: {accuracy_model_3:.2f}")
print(f"Accuracy of Model 4: {accuracy_model_4:.2f}")
#I took help from ChatGpt for this step. I was not sure which measure to choose
Accuracy of Model 1: 1.00 Accuracy of Model 2: 1.00 Accuracy of Model 3: 1.00 Accuracy of Model 4: 1.00
Here we check the accuracy between the model prediction and the actual target values. In our case all the models gave accurate predictions.
decision_boundary = -log_reg_1.intercept_ / log_reg_1.coef_
plt.axvline(x=decision_boundary, color='red', linestyle='-', label='Decision Boundary')
plt.xlabel(features[0])
plt.show()
decision_boundary_x1 = np.linspace(X_validate[features[0]].min(), X_validate[features[0]].max(), 10)
decision_boundary_x2 = -(log_reg_2.intercept_ / log_reg_2.coef_[0][1]) - ((log_reg_2.coef_[0][0] / log_reg_2.coef_[0][1]) * decision_boundary_x1)
plt.plot(decision_boundary_x1, decision_boundary_x2)
plt.xlabel(features[0])
plt.ylabel(features[1])
plt.show()
x1, x2 = np.meshgrid(np.linspace(X_validate[features[0]].min(), X_validate[features[0]].max(), 10), np.linspace(X_validate[features[0]].min(), X_validate[features[0]].max(), 10))
decision_boundary_x3 = -(log_reg_3.intercept_ / log_reg_3.coef_[0][2]) - ((log_reg_3.coef_[0][0] / log_reg_3.coef_[0][2]) * x1) - ((log_reg_3.coef_[0][1] / log_reg_3.coef_[0][2]) * x2)
import plotly.graph_objects as go
surface = go.Surface(x=x1, y=x2, z=decision_boundary_x3, colorscale='Reds', opacity=0.5, showscale=False)
fig = go.Figure(data=[surface])
fig.update_layout(
scene=dict(
xaxis_title='Feature 1',
yaxis_title='Feature 2',
zaxis_title='Feature 3'
),
title='Decision Boundary (Plane)'
)
fig.show()
failed = np.where(pred1 != y_validate)
print("Failure encountered at:",failed[0])
Failure encountered at: []
For the validation set I could not find any failure patterns.
In some models while analyzing predictions against actual values, I noticed that if the features were not sorted by the correlation with the target values, then we would have encountered some failures.
Best model recommendation:
To check the best model to be used, we check it's accuracy_score or prediction_score. In our case it was same for all. Then we look into Probability Prediction.
In model 4, the probability for predicting Non-Virginica and Virginca is the most accurate compared to the other models.
Therefore, we use model 4 containing all features for the best prediction
# test_acc1 = accuracy_score(y_test, log_reg_1.predict(X_test[features[0]].values.reshape(-1,1)))
# test_acc2 = accuracy_score(y_test, log_reg_2.predict(X_test[features[:2]]))
# test_acc3 = accuracy_score(y_test, log_reg_3.predict(X_test[features[:3]]))
test_acc4 = accuracy_score(y_test, log_reg_4.predict(X_test[features[:4]]))
print(f"Accuracy of Model 4: {test_acc4:.2f}")
# print(f"Accuracy of Model 1: {test_acc1:.2f}")
# print(f"Accuracy of Model 2: {test_acc2:.2f}")
# print(f"Accuracy of Model 3: {test_acc3:.2f}")
Accuracy of Model 4: 1.00